A brief introduction to sparklyr

@TiffanyTimbers / @UBC

06/14/2019

What to do when code is slow?

Attribution: Javier Luraschi’s talk slides from SDSS 2019

Scaling up vs scaling out

source: https://hadoop4usa.wordpress.com/2012/04/13/scale-out-up/

Scaling out

MapReduce (Hadoop) was the original big kid on the block in terms of scaling out.

Reminder of how MapReduce works

source: https://www.edureka.co/blog/mapreduce-tutorial/

Hadoop vs Spark

source: http://www.big-data.tips/apache-spark-vs-hadoop

Hadoop vs Spark

source: https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/

Choice for scaling out

Spark’s increases in speed and ease of use means there is now a faster and smoother kid on the block…

How/where can you use Spark

source: Zaharia et al. (2016). Apache Spark: A Unified Engine For Big Data Processing

source: https://www.slideshare.net/SparkSummit/trends-for-big-data-and-apache-spark-in-2017-by-matei-zaharia

Leading Spark cloud platforms

sparklyr + Databricks demo

Notebook 1 - Install sparklyr on Databricks cluster

Notebook 2 - Analysis demo

Full code from today’s demo

Notebook